Wrangling and Analyzing WeRateDogs Twitter Data

Data rarely comes in its most usable form. For this reason, data wrangling and exploratory data analysis are the difference between non-misleading data analysis and a simple garbage in, garbage out.

image.png

Table of Contents


Introduction


In this project, we will be wrangling, analyzing and visualizing the tweet archive of Twitter user @dog_rates, also known as WeRateDogs. We Rate Dogs is a Twitter account that does exactly what it says. It rates dogs. And their funny captions are just over the top. All you have to do is send them a serious or funny dog picture via direct message (or any dog come to think of it), and they'll rate it on a scale of 10. The funny thing, though? The ratings are almost always greater than 10. Why? Because "they're good dogs Brent". WeRateDogs has over 4 million followers and has received international media coverage.

Specifically, we intend to:

Importing Libraries


The following libraries will be useful during wrangling, analysis and visualization:

We will also create a color class to help us pretty-print formatted outputs:

Gathering Data


We will be gathering data from three different sources:

  • Enhanced Twitter archive data, compiled by @dogrates and shared with Udacity. This archive data contains basic tweet data for all 5000+ of their tweets, as at August 2017. Udacity provided us with this file, so we will treat it as a file on hand.
  • An Image predictions tsv file, compiled by running every image in the WeRateDogs Twitter archive through a neural network that can classify breeds of dogs. We will download this file programmatically from Udacity servers, using the requests library.
  • Additional data from the Twitter API: We will gather each tweet's retweet count, favorite ("like") count and hashtags used from the twitter API using the Tweepy library.

A. WeRateDogs Twitter Archive

B. Tweet Image Predictions

C. Additional Data from Twitter API

Assessing Data


Visual Assessment

A. WeRateDogs Archive

Notes:

Quality Issues

  • Some records are retweets or replies. Some may contain ratings, but they are not the original tweets. The information to identify them can be found in the collowing columns: in_reply_to_status_id, in_reply_to_user_id, retweeted_status_id, retweeted_status_user_id and retweeted_status_timestamp.
  • Unexpected ratings in the rating_numerator and rating_denominator columns. Examples are rating numerators as high as 666 and denominators as low as 0.
  • Unusual dog names such as a, an and not in the name column.

Tidiness Issues

  • The various stages of dog life: doggo, pupper, puppo, and floofer should be contained in one column.
  • Long and unneccessary links in the source column (text is embedded in HTML tags). All we need is the actual text.
  • Unwanted columns present: in_reply_to_status_id, in_reply_to_user_id, retweeted_status_id, retweeted_status_user_id and retweeted_status_timestamp.
  • rating_numerator and rating_denominator can be reduced into one column.

B. Image Predictions

Notes:

Quality Issues

  • The prediction in columns p1, p2 and p3 are not uniformly formatted. Some names are lowercase, some are uppercase and some are titlecase.
  • The predictions above also have words seperated by underscores instead of spaces.

Tidiness Issues

  • From p1, p2 and p3, we only need the most confident prediction that corresponds to an actual dog breed.

C. Tweet JSON Data

Notes:

  • Everything looks fine for now.

Programmatic Assessment

A. WeRateDogs Archive

Notes

  • tweet_id stored as int instead of string/object type.
  • 181 records are retweets and 78 records are replies. We don't need these records in our analysis.
  • timestamp column is stored as string/object type rather than as a Pandas datetime object.
  • The expanded_urls column has some null records.

Lets zoom into these records where the expanded_url is null:

Tweets with missing expanded_urls are either retweeted posts or replies. We don't need these records in our analysis.

  • We only want the information between the opening and closing anchor tags, signalling the tweet source.

Let's verify if the links in the text and expanded_url columns are different:

After testing each link, one would discover that, in each record, both the text and expanded url links lead to the same tweet.
Some records also have multiple expanded urls seperated by commas; all leading to the same tweet. As a result, we can make the following notes:

  • The text column contains both the tweet text and tweet url.
  • The same tweet url is already present in the expanded_urls column
  • Though WeRateDogs post can have numerators higher than 10, they almost always have denominators of 10. Having numerators as high as 1776 and denominators as low as 0 prompts us to inspect the dataframe further:
  • Some ratings were erroneously pulled from the original tweet. Especially when dates (e.g 24/7 and 9/11) or decimal ratings (e.g 11.27/10 and 9.75/10) are included in the tweet text.
  • Some high ratings appear to be addressed to groups of dogs. For example: 165/150, 84/70, 88/80.
  • Extremely high ratings like 1776/10 and 420/10.

During visual assessment, we identified some unusual dog names like a and an. These names were less than four characters long. We will examine the entire name column for names with less than four characters. We may probably find a lot of invalid names in this group:

  • Again we notice more unusual names like the, my, by, his, all, mad, life, very, old, this, just etc. All these unusual names are formatted in lower case, while the more viable names are properly capitalized.

We can use this criteria to query the entire name column. We will search for records with improper name capitalizations:

  • None of the improperly capitalized entries in the name column are valid dog names.
  • These entries constitute 109 records in total.
  • Asides the fact that we have to tidy up these columns into one, everything looks good.

B. Image Predictions

  • There are 2075 records here. This is 281 records shorter than the WeRateDogs archive data.
  • tweet_id is stored with the wrong datatype: It should be a string/object type.
  • We won't be needing the img_num column.
  • Everything looks okay here, confidence levels range from 0 - 1 across all columns.
  • It seems that not all the predictions in our img_predictions dataframe correspond to actual dog breeds.

Let's investigate this case further, and check for situations where none of the predictions detected a dog breed:

  • In about 324 cases, none of the predictions p1, p2 and p3 detected a dog breed.
  • The pulled images show that some of the tweets were not actually about dogs. This may explain why the algorithms didn't detect a dog in the first place.
  • Further examination also shows some instances where the neural network gave false negative responses (a dog was present, but was detected as absent).

C. Tweet Json Data

  • 2327 records present. 29 records lesser than WeRateDogs archive data.
  • Majority of these missing records were caused by Tweepy errors (probably from deleted tweets) during the gathering process.
  • When present, hashtags are stored as lists instead of as python strings.
  • Some tweets are associated with mutiple hashtags.

Finally, let's check for columns that may be common across the three dataframes:

  • The tweet_id column is the only common column accross the three datasets.

Assessment Summary


The section below summarizes the findings from both visual and programmatic assessment of the datasets.

Data Quality

A. WeRateDogs Archive:

B. Image Predictions

C. Tweet Json Data

Data Tidiness

A. WeRateDogs Archive:

B. Image Predictions

C. Tweet Json Data

Cleaning Data


Before cleaning, we will create individual copies of the three dataframes:

WeRateDogs Archive (Quality Issues)

Define:

  • From archive_clean, drop records where retweeted_status_id and in_reply_to_status_id are not null.

Code:

Test:


Define:

  • convert the tweet_id column to object/string type and the timestamp column to a Pandas datetime object.

Code:

Test:


Define:

  • Replace unusual names having improper capitalizations with None.

Code:

Test:


Define:

  • Drop the expanded_url column since the urls are already present in the tweet text.
  • Having multiple links leading to the same page is redundant. We will split the text column into two columns later.

Code:

Test:


Define:

The fact that the rating numerators are greater than the denominators does not need to be cleaned. This unique rating system is a big part of the popularity of WeRateDogs. However, we will:

  • Remove the records with the overly high ratings of 420/10 and 1776/10.
  • Remove the record with rating of 24/7. This is a date, not an actual rating; the right rating is absent from the text.
  • Programmatically extract the right ratings from text to replace the wrong ones.
  • Convert high ratings allocated to dog groups to a scale of 10. This will be done later, when tidying up the dataframe.

Code:

Test:


WeRateDogs Archive (Tidiness Issues)

Define:

  • Drop the unwanted columns from the archive_clean dataframe.

Code:

Test:


Define:

  • Extract the tweet url into a seperate column.
  • Remove tweet urls from tweet text.

Code:

Test:


Define:

  • Extract the device information from the source column.

Code:

Test:


Define:

  • Convert all ratings to a denominator scale of 0 using the expression: $rating = \frac{rating\,numerator}{rating\,denominator} \times 10 $. With this expression, a rating of 120/100 becomes 12/10 and a rating of 55/60 becomes 9.16/10.
  • Once the ratings are standardized, reduce ratings to a single column called rating.
  • Drop the rating_numerator and rating_denominator columns.

Code:

Test:

  • The ratings now appear standardized. However, it seems there are record(s) with ratings of 0. We should investigate this further:

Define:

  • Check and correct for conflicting dog stages, if present.
  • Store all the dog stages in a single column called stage.
  • Drop the columns doggo, pupper, puppo, and floofer.
  • Set the stage column to a categorical type.

Code:

  • It seems that some records actually present with multiple dog stages. This is hard to read through for now, so we will trim off the extra None characters.
  • 11 records have dogs classified as a mix of doggo and some other stages. To be sure this was not done in error, we can examine the tweet text in detail. These records are few, so we can manually identify and correct them.

After examining the tweet ids, the text and checking the tweet urls, we can obeserve the following:

  • 785639753186217984 is not about a dog. The tweet is actually about a hedgehog. We will remove this record.

Test:

Note: One more thing! let's format the dog stage entries to title case, then give doggopupper a more befitting value.

Finally, lets reset the dataframe index and preview our cleaning results:


Image Predictions (Tidiness Issues)

Note: Some of the tidiness issues in img_predictions overlap with quality issues. Solving the tidiness issues first can make dealing with quality issues easier later.

Define:

Part A

  • Iterate through each row of predictions_clean and extract the best prediction and confidence values.
  • Assign these values into new columns named breed and confidence.

Part B

  • Drop all unwanted columns: p1, p2, p3, p1_conf, p2_conf, p3_conf, p1_dog, p2_dog, p3_dog and img_num.

Code A:

Test A:

Code B:

Test B:


Image Predictions (Quality Issues)

Define:

  • Convert tweet_id column to string/object type.

Code:

Test:


Define:

  • Perform cleaning on the prediction column, since p1, p2 and p3 have been removed.
  • Replace underscores with spaces and format all entries to titlecase.

Code:

Test:

Note: We will not clean entries with dashes - since their use is grammatically correct in this case.


Define:

  • This has been addressed in the process of tidying up the data.
  • Predictions that are not dog breeds have also been assigned a value of Unknown.

Test:


Define:

  • Account for this by merging archive_clean and predictions_clean with an inner join. This way, only records common to both dataframes will be retained.
  • We will do this after cleaning the json_clean dataframe.

Tweet Json Data (Quality Issues)

Define:

  • Transform each hashtag from the tweet into a distinct record using the df.explode() method.

Code:

Test:


Tweet Json Data (Tidiness Issues)

Define:

  • We can resolve this by merging the json_clean dataframe into archive_clean.

Note:

It is better to create a master dataset by merging all the cleaned dataframes with an inner join. In addition to the issue listed above, this merge will help us address the following pending issue of unequal records across the three dataframes.


Creating a Master Dataset

Notes:

  • We can reorder the columns in a much more intuituve pattern.
  • We should assign a more descriptive name to columns like name, breed, stage, and jpg_url.

Lets add these finishing touches to our master dataframe:

Storing Data


Let's store our cleaned master dataframe locally to a named twitter_archive_master.csv

Analysis and Visualization


Our analysis will focus on exploring the data to understand the following:

  • User engagement will be assessed in terms of retweets and likes.

1. What are the top breeds in terms of popularity?

Logic:

  • Filter out records with for which we don't have a dog breed.
  • Consider the top 20 breeds in terms of the number of tweets posted.
  • Visualize this information using a bar graph.

Action:

Insights:

  • The Golden retriever is the most popular breed, having a total count of 155 tweets. The Labrador retriever follows behind with 103 tweets. In total, the two retrievers (golden and labrador) contributed to 258 tweets.
  • Other notable breeds include the Pembroke, Chihuahua and the Pug, occupying the 3rd to 5th place respectively.

Logic:

  • Filter out records for which we don't have a dog breed. We already have this information in known_breeds_df
  • Filter out breeds with less than 10 tweets, since we are considering popularity also.
  • Compute average ratings for each breed and select the top 20.
  • Visualize results.

Action:

Insights:

  • The Samoyed, Golden retriever, Great Pyrenees, Pembroke and Chow are the top five breeds in terms of average ratings.
  • 13 of the top rated breeds (13/20) are also present on the most popular list. It appears that these breeds enjoy in both rating and popularity.

3. What dog stage do users favor the most?

Logic:

  • Identify and filter out records where a dog stage was not mentioned.
  • Isolate only the proper dog stages (doggo, puppo and pupper). The floofer isn't an ideal dog stage since any of the other dog stages can be a floofer too.
  • Compute the average retweets, likes and ratings by dog stage.
  • Remove records with two stages in one tweet e.g a doggo and a pupper. It will be hard to tell the one that users liked.
  • Melt the resulting dataframe; for ease of plotting with Plotly.
  • Visualize results.

Action:

Insights:

  • Puppos seem to be the people favorite, leading in average likes (over 20,000) and ratings (12). Doggo tweets also show good engagements in terms of likes (17,412) and ratings (11.7).
  • Doggos enjoy marginally more retweets (5,906 retweets) than Puppos (5,708 retweets).
  • Puppers considerably gather the least retweets, likes and ratings.

4. Are Hashtags associated with higher tweet engagements?

Logic:

  • Identify the number of records that had hashtags included in the tweet. This helps understand if we have enough data to make conclusions.
  • Compute the average retweets, likes and ratings by hashtag use.

Action:

Insights:

  • Only very few records, 1.2% of the total tweets, actually used hashtags.
  • Ratings, retweets and likes seem higher on average, when hashtags are used. However, We cannot confidently make this conclusion, considering that there are far fewer records for tweets with hashtags.

A better alternative to this question is to try to understand the hashtags that generated the highest user engagements (retweets and likes) when used.


5. Which hashtags generated the highest engagements when used?

Logic:

  • Isolate only records where hashtags are used.
  • Aggregate retweets and likes based on each unique hashtag.
  • Visualize the results.

Action:

Insights:

  • #WomensMarch and #ScienceMarch gathered the highest share of interactions. On the other hand, hashtags such as #Swole and #NoDaysOff gained the least number of interactions.
  • These two hashtags are related to widespread events involving rallies (#ScienceMarch) or protests (#WomensMarch) worldwide. This could explain the high number of interactions recorded with their use.

6. How did the number of original posts, retweets and likes vary over the time period?

Logic:

  • Make a copy of the master dataframe and set the timestamp as the new dataframe index.
  • Resample the timestamp by month, aggregating the average number of posts, retweets, and likes in the process.
  • Visualize the results.

Action:

Insights:

  • The total number of original tweets posted on the account has been declining overall. This could have led one to believe that the account was gradually becoming unsuccessful with time.
  • The rising trend in retweets and favorites, however, tells a different story: Although the total number of tweets has been on a decline, the account has been gaining a lot of user interactions, moving from less than 1,000 retweets and 5,000 likes in late 2015, to over 6,000 retweets and 30,000 likes in late 2017.
  • This could be due to the fact that in earlier stages, an account may have created more tweets to gain popularity. As time progresses, people become familiar with the account. They may start to like and retweet contents for others to see. This can lead to a cycle of success, gradually reducing the need to create numerous contents before driving engagement.
  • There is also an interesting pattern seen with retweets and likes. They appear to fluctuate in the same direction (when retweets increase, likes increase and vice versa). We can investige this further by examining the correlation between both variables.

7. Is there really an association between retweets and likes?

Logic:

  • Sample 1000 records from the dataframe, then evaluate the relationship between both variables using a scatter plot.

Action:

Insights:

  • The association between many points on the scatterplot above is not very visible. Despite the stong positive correlation, outliers (tweets having very high number of both retweets and likes) are causing the points to be concentrated at the bottom left of the chart.

We can zoom-in to examine this relationship better by plotting a scatterplot with the log of both the x (retweets) and y (likes) axes:

Insights:

  • Retweets and likes show strong positive correlation, and this is especially clearer on a log scale.

8. Can we Identify the most favored dogs in each dog stage?

Logic:

  • Isolate records for each unique dog stage from the master dataframe.
  • Filter out Floofers (since this isn't an actual dog stage), Doggo with Pupper (it's hard to tell which one people liked in the tweet), and records where a dog stage wasn't specified.
  • Create a leaderboard system that sorts dogs based on tweet engagements (retweets and likes), then ratings.
  • Identify the leading dogs in each group and display their images as visual outputs.

Action:

Insights:

  • The people's favorite Doggo is a Labrador retriever swimming in a pool. We do not know its name, but it has gathered 70,913 retweets, 145,119 likes, and a rating of 13.
  • For the Puppos, its a Lakeland Terrier. We couldn't get its name, but this puppo participated in the Womens march in Canada, earning itself 40,002 retweets, 124,275 likes, and a rating of 13 in the process.
  • A French bulldog named Jamesy won it all for the Puppers. People seem to love the fact that he gives kisses to other dogs. He earned himself 20,275 retweets, 109,051 likes and a rating of 13 for being so tender.

Conclusion


Real life data rarely comes clean. In the course of this project, WeRateDogs twitter data was collected in fragments from different sources. Each piece of data was assessed for quality and tidiness, then cleaned. After wrangling, the datasets were combined into a single dataframe in preparation for further analysis.

Further analysis involved exploring the data and building visualizations. These explorations led to the following insights:

Limitations


References